trio-sga: facilitating de novo assembly of highly heterozygous genomes with parent-child trios
نویسندگان
چکیده
Motivation: Most DNA sequence in diploid organisms is found in two copies, one contributed by the mother and the other by the father. The high density of differences between the maternally and paternally contributed sequences (heterozygous sites) in some organisms makes de novo genome assembly very challenging, even for algorithms specifically designed to deal with these cases. Therefore, various approaches, most commonly inbreeding in the laboratory, are used to reduce heterozygosity in genomic data prior to assembly. However, many species are not amenable to these techniques. Results: We introduce trio-sga, a set of three algorithms designed to take advantage of motherfather-offspring trio sequencing to facilitate better quality genome assembly in organisms with moderate to high levels of heterozygosity. Two of the algorithms use haplotype phase information present in the trio data to eliminate the majority of heterozygous sites before the assembly commences. The third algorithm is designed to reduce sequencing costs by enabling the use of parents’ reads in the assembly of the genome of the offspring. We test these algorithms on a ‘simulated trio’ from four haploid datasets, and further demonstrate their performance by assembling three highly heterozygous Heliconius butterfly genomes. While the implementation of trio-sga is tuned towards Illuminagenerated data, we note that the trio approach to reducing heterozygosity is likely to have crossplatform utility for de novo assembly. Availability: trio-sga is an extension of the sga genome assembler. It is available at https://github.com/millanek/trio-sga, written in C++, and runs multithreaded on UNIXbased systems. Contact: [email protected], [email protected]
منابع مشابه
Detection of de novo copy number alterations in case-parent trios using the R package MinimumDistance
For the analysis of case-parent trio genotyping arrays, copy number variants (CNV) appearing in the offspring that differ from the parental copy numbers are often of interest (de novo CNV). This package defines a statistic, referred to as the minimum distance, for identifying de novo copy number alterations in the offspring. We smooth the minimum distance using the circular binary segmentation ...
متن کاملEfficient de novo assembly of large genomes using compressed data structures.
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA...
متن کاملEfficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads.
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based...
متن کاملIdentifying rare-variant associations in parent-child trios using a Gaussian support vector machine
As the availability of cost-effective high-throughput sequencing technology increases, genetic research is beginning to focus on identifying the contributions of rare variants (RVs) to complex traits. Using RVs to detect associated genes requires statistical approaches that mitigate the lack of power with the analysis of single RVs. Here we report the development and application of an approach ...
متن کامل